Add debug instrumentation for `test_play_services` #1013

Blast545 · 2022-05-12T21:36:01Z

Not a fix, but this PR should help with the test_play_services flaky tests described here: #862

Follow up discussion: #862 (comment)

Only reduced the number of msgs to 190 to confirm that rmw_fastrtps fails with specifically 195 iterations.
Only increased the timeout by 2 to avoid increasing the test time too much.

To test the PR I'll run two ci_linux jobs testing rosbag2_transport with the repeated jobs flags.

Signed-off-by: Jorge Perez [email protected]

Signed-off-by: Jorge Perez <[email protected]>

Blast545 · 2022-05-12T21:55:11Z

Testing with master + repeated flags on linux:

Rebuild:

Testing with this PR + repeated flags on linux:

Rebuild:

(adding a second round of CI because the error is flaky)

jhdcs · 2022-05-13T13:00:30Z

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

Sure, the issue may be from a separate repository, but this would be useful for providing data to the appropriate people in order to get the problem fixed...

clalancette · 2022-05-13T13:08:34Z

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

I think we should do both. This test has been failing for quite a long time, and it makes the lives of all ROS 2 developers difficult. So working around it is the right thing for the short-term, while a bug report to rmw_fastrtps is the right thing for the long term.

Blast545 · 2022-05-13T17:55:40Z

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

@jhdcs I'm not sure if this is what we want, I opened the PR with the intention of gathering more info from CI and receiving feedback from maintainers + ROS2 team as to how to proceed.

This flaky error is affecting 50% of our https://build.ros2.org/view/Rci/job/Rci__nightly-release_ubuntu_jammy_amd64/ builds atm, and hasn't been addressed in a long time. We should do something.

jhdcs · 2022-05-13T18:19:15Z

Perhaps I'm missing something, but this would just be masking an underlying problem that a test is identifying, right? Are we sure that we want that?

@jhdcs I'm not sure if this is what we want, I opened the PR with the intention of gathering more info from CI and receiving feedback from maintainers + ROS2 team as to how to proceed.

This flaky error is affecting 50% of our https://build.ros2.org/view/Rci/job/Rci__nightly-release_ubuntu_jammy_amd64/ builds atm, and hasn't been addressed in a long time. We should do something.

I agree that something should be done. And thank you for stepping up to the plate. I just don't want this PR to be used as an excuse for not fixing the underlying problem - which doesn't appear to be an issue here.

MichaelOrlov · 2022-05-13T18:33:44Z

@Blast545 @jhdcs @clalancette I would suggest some compromise solution.
I think at the moment it will be reasonable to try to increase wait timeout for the service calls and add some more debug information in tests.
I have tried to analyze this issue in more details last night comparing with other failing cases. I've traced all the pass till when we are sending request to via DDS data writer. And this path looks good for me. What I have not verified is the way how we are getting confirmation about response from service call.

What I've found is that mostly failing two tests PlaySrvsTest.toggle_paused and PlaySrvsTest.play_next. and mostly with rmw_fastrtps_dynamic_cpp. However I've seen the same failures with CycloneDDS.
It telling us that most likely the problem unrelated to the some specific DDS implementation or we incorrectly rely on some assumption which is true for both DDS implementations.

I have suspicious that there are some problem with timings or QoS. Although not sure in what place exactly. The higher failure rate with rmw_fastrtps_dynamic_cpp could be just because it's a bit slowly I guess.

I've tried to think about what is common in those two PlaySrvsTest.toggle_paused and PlaySrvsTest.play_next tests to cause a failure.
From Rosbag2::Player perspective of view almost nothing. PlaySrvsTest.play_next could take longer in response since it invoke publishing messages inside. However PlaySrvsTest.toggle_paused should be pretty fast there are almost nothing inside on Player side just lock for one mutex and toggling a flag.
I think that common in those two tests is the returning boolean value in play_next returning result true or false.
Toggle paused returning void. Although in test after each call for toggle_pause we are requesting status via another service call is_paused
https://github.com/ros2/rosbag2/blob/0.16.0/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L116-L120
https://github.com/ros2/rosbag2/blob/0.16.0/rosbag2_transport/test/rosbag2_transport/test_play_services.cpp#L273
which is also returning boolean value.

I have a suspicious that failure happening for cases when service call has returning value.

jhdcs · 2022-05-13T18:39:25Z

@Blast545 @jhdcs @clalancette I would suggest some compromise solution.
I think at the moment it will be reasonable to try to increase wait timeout for the service calls and add some more debug information in tests.

This feels very reasonable to me. Just in case it's not clear, my only concern is if the underlying issue is never addressed. As long as we're making concrete steps to identify and correct the issue, creating some stopgap measures is perfectly fine!

From the rest of your post... It almost sounds like we may have a race condition going on... Does anyone else get that feeling?

- Increased service_call_timeout_ up to 3 seconds. - Split `toggle_paused` on two separate tests `is_paused` and `toggle_paused` - Add output of the function name and line number in case of failure. Signed-off-by: Michael Orlov <[email protected]>

MichaelOrlov · 2022-05-13T20:46:07Z

@Blast545 I pushed commit with additional debug info.
Could you please try to re-run CI?

Blast545 · 2022-05-13T20:55:42Z

@MichaelOrlov Sure thing!

Repeated + PR on Linux

From the initial CI that I ran, I see that the second repeated + PR failed with rmw_fastrtps, potentially invalidating the theory of the 195 meaning something internally for fastrtps.
https://ci.ros2.org/job/ci_linux/16705/

MichaelOrlov · 2022-05-13T22:56:48Z

From the initial CI that I ran, I see that the second repeated + PR failed with rmw_fastrtps, potentially invalidating the theory of the 195 meaning something internally for fastrtps

Totally agree.

MichaelOrlov · 2022-05-13T23:20:46Z

@Blast545 What I can see is that test running twice per build and each time it fails on the same step or iteration.
The iteration could vary from build to build but it's very consistent between tests run from the same build.
For example in one build it fails twice in play_next on 95 iteration https://ci.ros2.org/job/ci_linux/16709/testReport/junit/(root)/projectroot/test_play_services__rmw_cyclonedds_cpp/ on another build it fails twice on 118 iteration https://ci.ros2.org/job/ci_linux/16709/testReport/junit/(root)/projectroot/test_play_services__rmw_fastrtps_cpp/ and I have seen the same behavior for toggle_paused.
And please note that it fails on both CycloneDDS and FastRTPS.

It's actually usually pointing out that we have somewhere a dangling pointer or dangling reference which is getting overwritten at some iteration depending from the current memory layout.

I am unfortunately don't have more time to debug this issue deeper. And I am not very familiar with implementation of the service calls and SingleThreadedExecutor. We need to find someone who can continue analysis of this issue from this point.

@wjwwood Could you please help with debugging failure in service calls? Or please advise someone who can help with it.

MichaelOrlov

@Blast545 @clalancette I would suggest to merge this PR with added debug instrumentation and increased wait time for service call.

I would like to see if it will still fail on CI in toggle_paused or in new is_paused test.

wjwwood · 2022-05-14T03:59:07Z

@wjwwood Could you please help with debugging failure in service calls? Or please advise someone who can help with it.

Yeah, I can help. It might be Monday though before I can catch up on the thread. I'd be happy to chat out of band on Monday as well.

Signed-off-by: Jorge Perez <[email protected]>

Blast545 · 2022-05-16T18:47:36Z

I changed the title of this PR to better reflect the debug info added in @MichaelOrlov 's commit.

Also restored the msgs_to_publish from 190 to 200 as the number 195 wasn't representative as we originally thought.

Blast545 · 2022-05-16T18:53:59Z

Full CI: --packages-up-to rosbag2_transport + packages-select rosbag2_transport

Linux
Linux-aarch64
Windows

Add redefinition for __PRETTY_FUNCTION__ to __FUNCSIG__ if it is not defined and not gcc compiler. Signed-off-by: Michael Orlov <[email protected]>

MichaelOrlov · 2022-05-16T20:11:39Z

@Blast545 Sorry I forgot that __PRETTY_FUNCTION__ doesn't exist on Windows. This is why CI failed on windows.
I've pushed workaround. Will see if it will fix the windows build.

Blast545 · 2022-05-16T20:40:36Z

Silly me, I ran the launcher with a wrong configuration. Re running:

Linux
Linux-aarch64
Windows

Increase test timeout and decrease msgs sent

105d4da

Signed-off-by: Jorge Perez <[email protected]>

Blast545 requested a review from a team as a code owner May 12, 2022 21:36

Blast545 requested review from gbiggs and jhdcs and removed request for a team May 12, 2022 21:36

MichaelOrlov self-requested a review May 13, 2022 23:58

MichaelOrlov approved these changes May 14, 2022

View reviewed changes

Restore msgs_to_publish

d5a86dc

Signed-off-by: Jorge Perez <[email protected]>

Blast545 changed the title ~~Increase test timeout and decrease msgs sent~~ Add debug instrumentation for test_play_services May 16, 2022

Fix for windows build

7b030fa

Add redefinition for __PRETTY_FUNCTION__ to __FUNCSIG__ if it is not defined and not gcc compiler. Signed-off-by: Michael Orlov <[email protected]>

MichaelOrlov merged commit e140329 into master May 16, 2022

delete-merged-branch bot deleted the blast545/test_play_services_improvement branch May 16, 2022 23:55

MichaelOrlov mentioned this pull request May 25, 2022

Reduce message spam when topics to be recorded do not exist #1018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add debug instrumentation for `test_play_services` #1013

Add debug instrumentation for `test_play_services` #1013

Blast545 commented May 12, 2022 •

edited

Loading

Blast545 commented May 12, 2022 •

edited

Loading

jhdcs commented May 13, 2022

clalancette commented May 13, 2022

Blast545 commented May 13, 2022

jhdcs commented May 13, 2022

MichaelOrlov commented May 13, 2022

jhdcs commented May 13, 2022

MichaelOrlov commented May 13, 2022

Blast545 commented May 13, 2022

MichaelOrlov commented May 13, 2022

MichaelOrlov commented May 13, 2022 •

edited

Loading

MichaelOrlov left a comment

wjwwood commented May 14, 2022

Blast545 commented May 16, 2022

Blast545 commented May 16, 2022

MichaelOrlov commented May 16, 2022

Blast545 commented May 16, 2022 •

edited

Loading

Add debug instrumentation for test_play_services #1013

Add debug instrumentation for test_play_services #1013

Conversation

Blast545 commented May 12, 2022 • edited Loading

Blast545 commented May 12, 2022 • edited Loading

jhdcs commented May 13, 2022

clalancette commented May 13, 2022

Blast545 commented May 13, 2022

jhdcs commented May 13, 2022

MichaelOrlov commented May 13, 2022

jhdcs commented May 13, 2022

MichaelOrlov commented May 13, 2022

Blast545 commented May 13, 2022

MichaelOrlov commented May 13, 2022

MichaelOrlov commented May 13, 2022 • edited Loading

MichaelOrlov left a comment

Choose a reason for hiding this comment

wjwwood commented May 14, 2022

Blast545 commented May 16, 2022

Blast545 commented May 16, 2022

MichaelOrlov commented May 16, 2022

Blast545 commented May 16, 2022 • edited Loading

Add debug instrumentation for `test_play_services` #1013

Add debug instrumentation for `test_play_services` #1013

Blast545 commented May 12, 2022 •

edited

Loading

Blast545 commented May 12, 2022 •

edited

Loading

MichaelOrlov commented May 13, 2022 •

edited

Loading

Blast545 commented May 16, 2022 •

edited

Loading